Language-independent Informative Topic Segmentation
نویسنده
چکیده
In this paper, we present an innovative topic segmentation system based on a new informative similarity measure that takes into account word co-occurrence in order to avoid the accessibility to existing linguistic resources such as electronic dictionaries or lexico-semantic databases such as thesauri or ontology. Topic Segmentation is the task of breaking documents into topically coherent multi-paragraph subparts. Topic Segmentation has extensively been used in Information Retrieval and Text Summarization. In particular, our architecture proposes a languageindependent Topic Segmentation system that solves three main problems evidenced by previous research: systems based uniquely on lexical repetition that show reliability problems, systems based on lexical cohesion using existing linguistic resources that are usually available only for dominating languages and as a consequence do not apply to less favored languages and finally systems that need previously existing harvesting training data.
منابع مشابه
Prosody-based automatic segmentation of speech into sentences and topics
A crucial step in processing speech audio data for informationextraction, topic detection, or browsing/playback is to segment the input into sentence and topic units. Speech segmentation is challenging, since the cues typically present for segmenting text (headers, paragraphs, punctuation) are absent in spoken language. We investigate the use of prosody (informationgleaned from the timing and m...
متن کامل60 36 v 1 2 7 Ju n 20 00 Prosody - Based Automatic Segmentation of Speech into Sentences and Topics
A crucial step in processing speech audio data for information extraction, topic detection, or browsing/playback is to segment the input into sentence and topic units. Speech segmentation is challenging, since the cues typically present for segmenting text (headers, paragraphs, punctuation) are absent in spoken language. We investigate the use of prosody (information gleaned from the timing and...
متن کاملStatistical methods for topic segmentation
Automatic Topic Segmentation is an important technology for multimedia archival and retrieval systems. In this paper we present an algorithm for topic segmentation which uses a combination of machine learning, statistical natural language processing, and information retrieval techniques. The performance of this algorithm is measured by considering the misses and false alarms on a manually segme...
متن کاملDiscovering Topic Boundaries for Text Summarization Based on Word Co-occurrence
Topic Segmentation is the task of breaking documents into topically coherent multiparagraph subparts. In particular, Topic Segmentation is extensively used in Text Summarization to provide more coherent results by taking into account raw document structure. However, most methodologies are based on lexical repetition that show evident reliability problems or rely on harvesting linguistic resourc...
متن کاملStatistical Physics for Natural Language Processing
In this paper we study the Enertex model that has been applied to fundamental tasks in Natural Language Processing (NLP) including automatic document summarization and topic segmentation. The model is language independent. It is based on the intuitive concept of Textual Energy, inspired by Neural Networks and Statistical Physics of magnetic systems. It can be implemented using simple matrix ope...
متن کامل